-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize 1x1 convolution for Network-in-Network style operation #1118
Conversation
// Special case: im2col is the identity for 1x1 convolution w/ stride 1, | ||
// so flag for skipping the buffer and transformation. | ||
is_1x1_ = kernel_w_ == 1 && kernel_h_ == 1 | ||
&& stride_h_ == 1 && stride_w_ == 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to check that there is zero padding, yes?
1x1 convolution with stride 1 is a special case of Caffe matrix multiplication convolution for which im2col / col2im transformations are actually the identity. For this special case the memory and transformation are skipped.
Sorry I was thinking in 3x3 case. No need for padding. On Friday, September 19, 2014, Evan Shelhamer notifications@github.com
Sergio |
Dtype* col_diff = NULL; | ||
if (!is_1x1_) { | ||
col_data = col_buffer_.mutable_cpu_data(); | ||
col_diff = col_buffer_.mutable_cpu_diff(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way... could we save memory in the usual case by changing this line to col_buffer_.mutable_cpu_data()
(i.e., by reusing the same buffer for both data and diff)? Perhaps I have missed something, but I don't see any reason in the code below why we need two separate buffers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch -- there's no need for the two at once since the col_data
is only for the gradient w.r.t. the weight while col_diff
is only for the gradient w.r.t. the bottom. Should we parallelize these in the future separate buffers will be needed, but that can be adjusted when we cross that bridge. Check out the follow-up commit.
Looks pretty good. It might be worth a comment near the |
This reminds me that we should recover the shared_col_buffer across Any suggestion how which class should be responsible for providing them? On Friday, September 19, 2014, longjon notifications@github.com wrote:
Sergio |
@sguada seems to me that Net should broker shared blobs as requested then each layer can reshape them on-the-fly as needed. The memory is shared across layers but still owned by the Net and will be freed along with the Net. @longjon's PR lets the blobs grow to the largest size needed. Could be worth a try for fully-convolutional models in the regime where Caffe's matrix multiplication is faster than cuDNN (at present). |
conv forward / backward only need one of the im2col data and diff at-a-time so consolidating the two saves a lazy allocation.
Optimize 1x1 convolution for Network-in-Network style operation
Awesome, this looks perfect to me. Thanks @shelhamer for writing this nice tight optimization (and being super responsive!) |
@shelhamer I think we could do the same trick in the case that the filters have the same size as the bottoms, and no padding, so therefore not stride would be needed and only one matrix multiplication is needed. Useful to replace fully connected layers with convolutions. |
@sguada yes the fully-connected / bottom dimensions = filter dimensions The other optimization is to allow batched im2col / col2im when memory On Wednesday, September 24, 2014, Sergio Guadarrama <
|
@sguada @shelhamer
which is rather more general than both the special cases discussed so far. Re: batched buffers: you would also get this for free in the above case. I wonder how much of a difference it makes though? |
@longjon I think there are some symmetric cases missing in that formula. i.e:
How about this formula:
|
@sguada No, it's trickier than that, and not symmetric in the way you are thinking, because row-major order goes left-to-right, up-to-down. You might think that you could get the same optimization for the column-major contiguous cases by changing the transposition parameters, but I think that cannot be done because of the channel dimension. |
@longjon yeah you are right I forgot to consider the asymmetry introduced by the row-major.
|
@sguada I believe your expression is correct, but I think it's less clear to include redundant cases. One way or another, I think the expression should be explained by comments (although one really has to have the right picture to understand it). E.g., I would suggest writing: // For the column buffer to be identical to the input, we must have...
// zero padding, plus...
(pad_h == 0 && pad_w == 0) &&
// the kernel must tile the input horizontally, and have height one...
((stride_w == kernel_w && width % kernel_w == 0 && kernel_h == 1)
// unless it takes the whole width of the input, in which case it must
// tile the input vertically, or take the whole height of the input!
|| (width == kernel_w && ((stride_h == kernel_h && height % kernel_h == 0)
|| height == kernel_h))) |
Optimize 1x1 convolution for Network-in-Network style operation
Optimize 1x1 convolution for Network-in-Network style operation
1x1 convolution with stride 1 and no padding is a special case of Caffe matrix multiplication convolution for which im2col / col2im transformations are actually the identity. For this special case the memory and transformation are skipped.
This optimizes the execution of 1x1 convolution i.e. NIN / CCCP convolutions.
@mavenlin